feat(grpc): enforce a maximum number of reconnection attempts#902
feat(grpc): enforce a maximum number of reconnection attempts#902Molter73 wants to merge 1 commit into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #902 +/- ##
==========================================
+ Coverage 32.72% 34.35% +1.62%
==========================================
Files 21 21
Lines 2695 2771 +76
Branches 2695 2771 +76
==========================================
+ Hits 882 952 +70
- Misses 1810 1816 +6
Partials 3 3 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
If fact fails to connect to the gRPC server in the specified number of attemtps it will crash, which can be used as a clear signal in k8s that something is not working as expected. If the number of retries is set to 0, there will be no limit to the amount of times fact attempts to reconnect. In order to allow the reconnection failure to trigger an application wide crash, the main output task monitors the result of the grpc task's handle propagating the error further up. This method should allow for other output components to be added in the future and follow this same pattern without having to change anything outside the output module. The stdout component is not included in this logic because it has no condition that could merit an application wide crash. We also now propagate the error of worker tasks all the way up to the termination of the application.
a6f8c0e to
669f5be
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
fact/src/output/mod.rs (1)
47-74: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy liftAdd focused coverage for gRPC supervision failures.
This loop is the new contract that converts
Ok(Err(_))and join failures into output-level errors, but the PR coverage report shows this file has no patch coverage. A small async test or extracted helper around thematch respath would protect the shutdown behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fact/src/output/mod.rs` around lines 47 - 74, Add focused test coverage for the gRPC supervision path in the output loop so shutdown/error handling is exercised. The `tokio::spawn` block in `output::mod` currently maps `grpc_handle.borrow_mut()` results into output-level failures, so add a small async test or extract a helper around the `match res` branch to verify both `Ok(Err(_))` and join/task errors are handled as expected.fact/src/config/tests.rs (1)
567-582: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd the negative retries parse-error case.
Once the parser rejects negative retry counts, cover
retries: -1here so YAML validation matches the new retry-budget contract.Proposed test case
( r#" grpc: backoff: retries: true "#, "invalid grpc.backoff.retries: Boolean(true)", ), + ( + r#" + grpc: + backoff: + retries: -1 + "#, + "invalid grpc.backoff.retries: Integer(-1)", + ),As per coding guidelines,
fact/src/config/mod.rs: "Add unit tests in fact/src/config/tests.rs for configuration schema changes in fact/src/config/mod.rs."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@fact/src/config/tests.rs` around lines 567 - 582, The config parsing tests in tests.rs are missing coverage for the new negative retry validation in grpc.backoff.retries. Add a unit test case alongside the existing retries parse-error assertions in the config tests to verify that retries: -1 is rejected with the appropriate invalid grpc.backoff.retries error, using the same pattern as the current retries validation cases.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@fact/src/config/mod.rs`:
- Around line 416-420: Reject negative YAML retry counts in the retries parsing
branch of the config loader, since converting a signed value to u64 can wrap and
bypass the intended retry limit. Update the logic in the backoff/retries
handling inside mod.rs to accept 0 and positive values only, and explicitly bail
out when v.as_i64() returns a value below zero before assigning to
backoff.retries_max. Also add unit coverage in fact/src/config/tests.rs for the
retries configuration path to verify negative values fail and zero remains
valid.
In `@fact/src/output/grpc.rs`:
- Around line 217-220: The reconnect exhaustion path in gRPC output handling
drops the last connection failure, so update the backoff termination branch in
the connection retry loop to preserve the final error details when reconnection
attempts are exhausted. In the logic around the backoff handling in the gRPC
connection code, make the terminal bail/error include the original connection
error value from the retry loop so callers can see the actual
DNS/TLS/refused-connection cause instead of only “attempts exhausted”.
---
Nitpick comments:
In `@fact/src/config/tests.rs`:
- Around line 567-582: The config parsing tests in tests.rs are missing coverage
for the new negative retry validation in grpc.backoff.retries. Add a unit test
case alongside the existing retries parse-error assertions in the config tests
to verify that retries: -1 is rejected with the appropriate invalid
grpc.backoff.retries error, using the same pattern as the current retries
validation cases.
In `@fact/src/output/mod.rs`:
- Around line 47-74: Add focused test coverage for the gRPC supervision path in
the output loop so shutdown/error handling is exercised. The `tokio::spawn`
block in `output::mod` currently maps `grpc_handle.borrow_mut()` results into
output-level failures, so add a small async test or extract a helper around the
`match res` branch to verify both `Ok(Err(_))` and join/task errors are handled
as expected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 81043491-08dc-4da7-a651-5e8569ee9b26
📒 Files selected for processing (5)
fact/src/config/mod.rsfact/src/config/tests.rsfact/src/lib.rsfact/src/output/grpc.rsfact/src/output/mod.rs
| "retries" => { | ||
| let Some(retries) = v.as_i64() else { | ||
| bail!("invalid grpc.backoff.retries: {v:?}"); | ||
| }; | ||
| backoff.retries_max = Some(retries as u64); |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Reject negative YAML retry counts before casting.
Line 420 turns values like retries: -1 into a huge u64, effectively bypassing the configured retry budget instead of failing configuration parsing. Keep 0 valid, but reject values below zero.
Proposed fix
"retries" => {
- let Some(retries) = v.as_i64() else {
+ let Some(retries) = v.as_i64().filter(|retries| *retries >= 0) else {
bail!("invalid grpc.backoff.retries: {v:?}");
};
backoff.retries_max = Some(retries as u64);
}As per coding guidelines, fact/src/config/mod.rs: "Add unit tests in fact/src/config/tests.rs for configuration schema changes in fact/src/config/mod.rs."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "retries" => { | |
| let Some(retries) = v.as_i64() else { | |
| bail!("invalid grpc.backoff.retries: {v:?}"); | |
| }; | |
| backoff.retries_max = Some(retries as u64); | |
| "retries" => { | |
| let Some(retries) = v.as_i64().filter(|retries| *retries >= 0) else { | |
| bail!("invalid grpc.backoff.retries: {v:?}"); | |
| }; | |
| backoff.retries_max = Some(retries as u64); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fact/src/config/mod.rs` around lines 416 - 420, Reject negative YAML retry
counts in the retries parsing branch of the config loader, since converting a
signed value to u64 can wrap and bypass the intended retry limit. Update the
logic in the backoff/retries handling inside mod.rs to accept 0 and positive
values only, and explicitly bail out when v.as_i64() returns a value below zero
before assigning to backoff.retries_max. Also add unit coverage in
fact/src/config/tests.rs for the retries configuration path to verify negative
values fail and zero remains valid.
Source: Coding guidelines
| let Some(delay) = backoff.next() else { | ||
| bail!("Failed to connect to server: reconnection attempts exhausted"); | ||
| }; | ||
| info!("Failed to connect to server: {e:?}, retrying in {delay:?}"); |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Preserve the final connection error on exhaustion.
Line 218 returns only “attempts exhausted”, so the terminal error can hide whether the last failure was DNS, TLS, refused connection, etc.
Proposed fix
let Some(delay) = backoff.next() else {
- bail!("Failed to connect to server: reconnection attempts exhausted");
+ bail!(
+ "Failed to connect to server: reconnection attempts exhausted after last error: {e:?}"
+ );
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let Some(delay) = backoff.next() else { | |
| bail!("Failed to connect to server: reconnection attempts exhausted"); | |
| }; | |
| info!("Failed to connect to server: {e:?}, retrying in {delay:?}"); | |
| let Some(delay) = backoff.next() else { | |
| bail!( | |
| "Failed to connect to server: reconnection attempts exhausted after last error: {e:?}" | |
| ); | |
| }; | |
| info!("Failed to connect to server: {e:?}, retrying in {delay:?}"); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@fact/src/output/grpc.rs` around lines 217 - 220, The reconnect exhaustion
path in gRPC output handling drops the last connection failure, so update the
backoff termination branch in the connection retry loop to preserve the final
error details when reconnection attempts are exhausted. In the logic around the
backoff handling in the gRPC connection code, make the terminal bail/error
include the original connection error value from the retry loop so callers can
see the actual DNS/TLS/refused-connection cause instead of only “attempts
exhausted”.
| }, | ||
| ..Default::default() | ||
| }, | ||
| ), |
There was a problem hiding this comment.
Add some tests with negative retries.
Description
If fact fails to connect to the gRPC server in the specified number of attemtps it will crash, which can be used as a clear signal in k8s that something is not working as expected. If the number of retries is set to 0, there will be no limit to the amount of times fact attempts to reconnect.
In order to allow the reconnection failure to trigger an application wide crash, the main output task monitors the result of the grpc task's handle propagating the error further up. This method should allow for other output components to be added in the future and follow this same pattern without having to change anything outside the output module. The stdout component is not included in this logic because it has no condition that could merit an application wide crash.
Checklist
Automated testing
If any of these don't apply, please comment below.
Testing Performed
Pointed fact to connect to a non-existant gRPC server and got it to crash as expected.
fact output on crash
Summary by CodeRabbit
New Features
Bug Fixes